Abstract: The deduplication process is nothing but finding duplicate records or duplicate data when comparing with one or more data base or data sets. The process in which we match records from several data bases is known as record linkage. The matched data (which is out- put of whole deduplication process) contains important and useable information. This information is too costly to acquire because of which deduplication process getting more attention day by day. In data cleaning process removing duplicate records in a single database is a critical step, because outcomes of subsequent data processing or data mining may get greatly influenced by duplicates. As the database size increasing day by day the matching process’s complexity becoming one of the major challenges for record linkage and deduplication. To overcome this in some extent we propose a Two Stage Sampling Selection (T3S) model in this article. Basically T3S has two stages, in which, in the first stage the strategy is proposed to produce balanced subsets candidate pairs which are to be labeled. In the second stage to produce smaller and more informative training sets than in the first stage an active selection is incrementally invoked so that redundant pairs get removed which are created in the first stage. We are extending our work in classification phase by using more advanced classification approach i.e. Adaboost algorithm. Several studies said that Adaboost gives better accuracy than SVM classifier. Our experimental results on real world dataset will show the comparative analysis of both methods, which proves that proposed method, performs better as compare to SVM. This document gives formatting instructions for authors preparing papers for publication in the Proceedings of an International Journal. The authors must follow the instructions given in the document for the papers to be published. You can use this document as both an instruction set and as a template into which you can type your own text.

Keywords: Deduplication, T3S, Adaboost.